Author: Philip Reschke (http://www.philipreschke.com).
Project: https://github.com/PhilipReschke/TensorFlow-Code-Examples
I will build a Recurring Neural Network using a number of LSTM layers to predict whether a movie review is positive or negative. I will be using an embedding layer instead of one-hot encoding all my inputs as that is comutationally inefficient when we have 70,000+ words. This is a modified example of the RNN sentiment example that is part of the Udacity Deep Learning Foundation class.
In [1]:
import tensorflow as tf
import numpy as np
As for the raw movie reviews and their positive/negative classification, I will be using the 25,000 reviews available on the Udacity Deep Learning GibHub page @ https://github.com/udacity/deep-learning/blob/master/sentiment_network/.
In [2]:
with open('data/reviews.txt', 'r') as raw_reviews:
reviews = raw_reviews.read()
with open('data/labels.txt', 'r') as raw_labels:
labels = raw_labels.read()
Our reviews data looks like this:
In [3]:
reviews[:1000]
Out[3]:
As we can see from the above example, the data is not very tidy as it contains line brakes such as '\n' - these actually represent a new rewiew. Using a few tidy rules, I will be able to clean up the raw so that we will have less noise when training our model.
In [4]:
# Removing punctuations and 'br' line brake tags
reviews_complete_text = ''.join([char for char in reviews if char != '.'])
reviews_complete_text = reviews_complete_text.replace(' br br ', '')
# Splitting the reviews sring into a list of reviews
reviews_list = reviews_complete_text.split('\n')
The reviews are now seperated from eachother and stored in seperate list objects. Here are the first two reviews:
In [5]:
reviews_list[0:2]
Out[5]:
We will also need to build up a vocabulary of all the words in our reviews dataset so lets do that now:
In [6]:
text_in_reveiws = ''.join(reviews_list)
words_in_reviews= text_in_reveiws.split()
In [7]:
words_in_reviews[:10]
Out[7]:
In [8]:
from collections import Counter
# Create word counter and sort by the number of occurences of each word in descending order
word_counts = Counter(words_in_reviews)
vocabulary = sorted(word_counts, key=word_counts.get, reverse=True)
# Create a dictionary of words
vocabulary_to_int = {word: i for i, word in enumerate(vocabulary, 1)}
# Create empty reviews list
reviews_int = []
for each in reviews_list:
reviews_int.append([vocabulary_to_int[word] for word in each.split()])
Our reviews now appear as integers. Here are the first two reviews as integers and raw text. Nice!
In [9]:
np.array(reviews_int)[0:2], reviews_list[0:2]
Out[9]:
In [10]:
labels_split = labels.split('\n')
labels = np.array([1 if label == 'positive' else 0 for label in labels_split])
Our labels are now encoded as 1 and 0. See here:
In [11]:
np.array(labels_split[0:20]), labels[0:20]
Out[11]:
In [12]:
# Review length counter
review_length = Counter([len(review) for review in reviews_int])
# Reviews with 0 length and longest review
review_length[0], max(review_length)
Out[12]:
Lets first remove the reviews of zero length by creating an index:
In [13]:
# Index of reviews with non-zero length
non_zero_idx = [ii for ii, review in enumerate(reviews_int) if len(review) != 0]
In [14]:
# Remove zero length reviews from reviews and labels
reviews_int = [reviews_int[ii] for ii in non_zero_idx]
labels = np.array([labels[ii] for ii in non_zero_idx])
# Check labels and reviews length
len(reviews_int), len(labels)
Out[14]:
Now that the empty review is gone, let's reduce the length of each review to a fixed length of 250 words and pad with 0 where reviews are shorter than 250 words.
While we are at it, we might as well create our input array, which will be N by M, where N is the number of reviews in our dataset and M is our desired review length. Hence, each row is a review of wanted review length.
In [15]:
# Desired review length - we cut of the rest of the review
seq_len = 250
# Creating a zero matrix of dimensions N by M
features = np.zeros((len(reviews_int), seq_len), dtype=int)
# Filling in the reviews
for i, row in enumerate(reviews_int):
features[i, -len(row):] = np.array(row)[:seq_len]
We now have our features ready for training:
In [16]:
features[:2,:-100]
Out[16]:
In [17]:
features.shape
Out[17]:
My reviews and labels data is now cleanred up as much as required to demonstrate how a RNN works with TensorFlow so its time to split it into a training, validation and testing set. Lets set aside 80% for training and 10% for validation and testing, respectively.
In [18]:
# Split faction
training_faction = 0.8
# Index to split the dataset at for training and validation
training_idx = int(len(features) * training_faction)
# Splitting the dataset for train and val
train_x, val_x = features[:training_idx], features[training_idx:]
train_y, val_y = labels[:training_idx], labels[training_idx:]
# Index to split the dataset at for validation and testing
validation_idx = int(len(val_x) * 0.5)
# Splitting the dataset for val and testing
val_x, test_x = val_x[:validation_idx], val_x[validation_idx:]
val_y, test_y = val_y[:validation_idx], val_y[validation_idx:]
Our dataset is now split into the following training, validation and testing set:
In [19]:
print("Training: \t\t{}".format(train_x.shape),
"\nValidation: \t\t{}".format(val_x.shape),
"\nTesting: \t\t{}".format(test_x.shape))
In [20]:
lstm_size = 512 # Number of units in each of our hidden layers in the LSTM cells
lstm_layers = 2 # Number of LSTM layers in the network
batch_size = 500 # Number of reviews, out of the 20,000 in our training set, that we feed into the network in one go
learning_rate = 0.005 # Our learning rate for use in the Adam optimizer
In [21]:
n_words = len(vocabulary_to_int)
# Create the graph object
graph = tf.Graph()
# Add nodes to the graph
with graph.as_default():
inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
In [22]:
n_words
Out[22]:
In [23]:
# Number of units in the embedding layer
embed_size = 300
with graph.as_default():
embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs_)
In [24]:
with graph.as_default():
# Your basic LSTM cell
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
# Add dropout to the cell
drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
# Stack up multiple LSTM layers, for deep learning
cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
# Getting an initial state of all zeros
initial_state = cell.zero_state(batch_size, tf.float32)
In [25]:
with graph.as_default():
outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
initial_state=initial_state)
In [26]:
with graph.as_default():
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
cost = tf.losses.mean_squared_error(labels_, predictions)
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
In [27]:
with graph.as_default():
correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
In [28]:
def get_batches(x, y, batch_size=100):
# Calculate number of batches
n_batches = len(x)//batch_size
# Obtain only full batches
x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
for ii in range(0, len(x), batch_size):
yield x[ii:ii+batch_size], y[ii:ii+batch_size]
In [29]:
# If the checkpoints directory doesn't exist:
!mkdir checkpoints
In [30]:
epochs = 5
with graph.as_default():
saver = tf.train.Saver()
with tf.Session(graph=graph) as sess:
sess.run(tf.global_variables_initializer())
iteration = 1
for e in range(epochs):
state = sess.run(initial_state)
for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
feed = {inputs_: x,
labels_: y[:, None],
keep_prob: 0.5,
initial_state: state}
loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
if iteration%5==0:
print("Epoch: {}/{}".format(e, epochs),
"Iteration: {}".format(iteration),
"Train loss: {:.3f}".format(loss))
if iteration%25==0:
val_acc = []
val_state = sess.run(cell.zero_state(batch_size, tf.float32))
for x, y in get_batches(val_x, val_y, batch_size):
feed = {inputs_: x,
labels_: y[:, None],
keep_prob: 1,
initial_state: val_state}
batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
val_acc.append(batch_acc)
print("Val acc: {:.3f}".format(np.mean(val_acc)))
iteration +=1
saver.save(sess, "checkpoints/sentiment.ckpt")
In [31]:
test_acc = []
with tf.Session(graph=graph) as sess:
saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
test_state = sess.run(cell.zero_state(batch_size, tf.float32))
for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
feed = {inputs_: x,
labels_: y[:, None],
keep_prob: 1,
initial_state: test_state}
batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
test_acc.append(batch_acc)
print("Test accuracy: {:.3f}".format(np.mean(test_acc)))